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With the complete human genomic sequence being unraveled, the focus will shift to gene identification and to 
the functional analysis of gene products. The generation of a set of cDNAs, both sequences and physical clones, 
which contains the complete and noninterrupted protein coding regions of all human genes will provide the 
indispensable tools for the systematic and comprehensive analysis of protein function to eventually understand 
the molecular basis of man. Here we report the sequencing and analysis of 500 novel human cDNAs containing 
the complete protein coding frame. Assignment to functional categories was possible for 52% (25?) of the 
encoded proteins, the remaining fraction having no similarities with known proteins. By aligning the cDNA 
sequences with the sequences of the finished chromosomes 21 and 22 we identified a number of genes that 
either had been completely missed in the analysis of the genomic sequences or had been wrongly predicted. 
Three of these genes appear to be present in several copies. We conclude that full-length cDNA sequencing 
continues to be crucial also for the accurate identification of genes. The set of 500 novel cDNAs, and another 
1000 full-coding cDNAs of known transcripts we have identified, adds up to cDNA representations covering 
2%-5 % of all human genes. We thus substantially contribute to the generation of a gene catalog, consisting of 
both full-coding cDNA sequences and clones, which should be made freely available and will become an 
invaluable tool for detailed functional studies. 

[The sequence data described in this paper have been submitted to the EMBL database under the accession nos. 
given in Table 2.] 



The recent past has witnessed major advances in the 
determination of the sequence of the human genome 
(Dunham et al. 1999; Hattori et al. 2000). Although the 
whole genomic sequence will be completely unraveled 
in the near future (Collins et al 1998), the identifica- 
tion of genes and the deciphering of gene structures 
will extend for a prolonged time; and cDNA sequences 
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will continue to be invaluable tools for this adventure, 
especially in view of alternative splicing. The primary 
focus will shift to the functional analysis of the genes 
and their protein products to finally understand the 
molecular basis of human life. Current estimates vary 
between 29,000 and >70,000 genes to constitute the 
protein coding repertoire of the human genome (Fields 
et al. 1994; Ewing and Green 2000; Liang et al. 2000; 
Roest Crollius et al. 2000). However, thus far only some 
11,000 cDNA sequences have been deposited in public 
databases, which are supposed to contain the complete 
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protein coding open reading frame (ORF). The major- 
ity of the respective cDNA clones are most likely not 
accessible. The generation of a physical clone set rep- 
resenting all human genes that should be made freely 
accessible is consequently regarded to have an ex- 
tremely high impact (Schuler 1997; Pruitt et al. 2000). 
This would permit the establishment of a catalog of 
clones to provide the resources needed in the proteom- 
ics era where the functions of proteins, their action in 
pathways, and the possible disease relation are deci- 
phered. 

Until recently, the long-cDNA sequencing project 
carried out at the Kazusa Institute (Nomura et al. 1994; 
Nagase et al. 2000) Consortium had been the only sys- 
tematic full-length cDNA sequencing project with a 
significant output of novel sequence information. The 
initiation of a new large-scale cDNA sequencing 
project has been announced lately that is coordinated 
by the National Institute of Health (Strausberg et al. 
1999). We founded a cDNA Consortium in 1997 as part 
of the German Genome Project and aim at the charac- 
terization of the complete sequences of novel human 
transcripts at the cDNA level. 

Here, we report the sequences and analysis of 500 
novel human cDNAs that all contain the complete pro- 
tein coding region. These cDNAs constitute the most 
valuable essence of 30,000 clones that have been EST 
sequenced and 3630 fully sequenced cDNAs. Over 
1000 cDNAs that cover the complete coding sequence 
of already known transcripts have been identified in 
the EST-sequenced clone set. All clones are made avail- 
able through the Resource Center of the German Ge- 
nome Project (RZPD). 

RESULTS 

Libraries and Clones 

To identify and sequence novel human cDNAs we have 
5 '-EST sequenced >30,000 independent cDNA clones. 
Bioinformatic evaluation of these sequences (Fig. 1) led 
to the identification of full-coding clones of already 
known proteins (>1000), and to cDNA clones lacking 
database hits, which are potential targets for full- 
length sequencing. Presumably novel cDNAs were 3'- 
EST sequenced and again analyzed for novelty. Out of 
the initial clones, 3630 cDNAs have been fully se- 
quenced thus far, totaling. 8.8 Mb. The sequence subset 
described here comprises 500 novel human cDNAs 
that are representations of the complete protein cod- 
ing part of the original transcripts. Also the other fully 
sequenced cDNAs represent mostly genes that have 
not been fully sequenced elsewhere; however, the 
clones are not likely to contain the complete protein 
coding region of the respective transcripts, or they con- 
tain frame-shift mutations that have probably been in- 
troduced during reverse transcription in the cloning 




Figure 1 Flow of clones, sequences, and information in the 
German cDNA Consortium. 5' EST sequences were systematically 
generated from the clones of 384-well microtiter plates and ana- 
lyzed for hits in public databases. Clones with novel sequences 
were 3'-EST sequenced and these ESTs were analyzed again for 
novelty. Clones of uncharacterized transcripts were reported 
back to the sequencers who then did the full-length sequencing, 
of cDNAs. The final sequence was analyzed comprehensively with 
bioinformatic tools and the outputs were evaluated manually. 
The clones feed functional analysis projects that take advantage 
of the clone resources generated. 

process. Therefore, these clones are only of reduced 
value for functional analysis. The number of bases re- 
ported for the 500 full-coding cDNAs is 1,264,620 bp; 
the average insert size of the clones is 2529 bp. The 
clones originate from five different cDNA libraries that 
have been sampled in varying numbers of clones 
(Table 1) to maximize the likelihood of identifying 
novel cDNAs. 

The calculated average size of the encoded pro- 
teins was 470 amino acid residues, which equals the 
number that has been reported previously for some 
1200 genes (Makalowski and Boguski 1998). There was, 
however, a wide variation between 66 and 1805 resi- 
dues. The cDNA identifiers, the respective sequence ac- 
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cession numbers (EMBl/GenBank/DDBJ), cDNA sizes, 
the length of ORFs, the chromosomal location, and 
functional details for the individual cDNAs are broken 
down in Table 2. This table is available in its entirety at 
http://www.dkfz-heidelberg.de/abt0840/GCC. 

Features of 5'- and ^'-Untranslated Regions 

The 5 '-untranslated regions (UTRs) averaged 148 nt, 
which is the same range as that reported previously 
(Pesole et al. 1996) but considerably shorter than the 
number (215 nt) calculated in the UTRdb (Pesole et al. 
2000). There was a wide variation in size ranging up to 
>800 nt (e.g., DKFZp761F182). Even this long 5'-UTR 
was consistent with the scanning model for transla- 
tional initiation (Kozak 1999) as there was no AUG 
codon in this stretch of sequence. In-frame stop 
codons upstream from the initiator ATG were present 
in 56.4% (282) of the cDNAs. This number is consistent 
with that observed with cDNAs isolated from oligo- 
nucleotide cap ligation libraries (Suzuki et al. 2000), 
where the cDNAs have been selected to contain the 
extreme 5' ends of the respective transcripts. The over- 
all GC content in the 5'-UTRs (56.3%) was consider- 
ably higher than that in the coding regions (52.6%) 
and the 3 '-UTRs (45.7%). This is consistent with the 
finding that CpG islands frequently extend into the 
transcribed sequence (Cross and Bird 1995) whereas 
elements present in the 3'-UTR are often AU rich (Xu et 
al. 1997). 

The average size of the 3 '-UTRs was 926 nt [not 
including the poly (A) tail], which is considerably larger 
than the 388 nt and 820 nt reported by Makalowski 
and Boguski (1998) and Pesole et al (1996), respec- 
tively. This discrepancy probably derives from the 
longer average size of the cDNAs described here, as 
compared with that observed in the previous studies. 
As with the 5'-UTR there was great variability with the 
size of the 3'-UTR. The translation terminator codon 
TAA could be part of the polyadenylation signal (e.g., 
in clone DKFZp564F2272) whereas in other cDNAs the 
3'-UTR was found to be >4000 nucleotides (e.g., 
DKFZp486C1218). 

We screened for the presence of repeat structures 
across the cDNA sequences. The Alu repeat family was 



most frequently contained in the cDNAs; 7.6 % (38) of 
the cDNA inserts carried this type of repeat. 11 repeats 
were present in two cDNAs; one cDNA contained both 
1TR2 and Alu repeats (DKFZp761G18121). The repeat 
structures were, without exception, located in the 3'- 
UTR of the respective cDNAs. However, in a number of 
other cDNAs we found repeats also in the presumed 
5'-UTRs. All of these clones turned out to be not com- 
pletely spliced and/or partial upon further analysis, 
and having intronic sequence at the 5' ends. We there- 
fore reason that the presence of repeat structures in 
5 '-UTRs of transcripts is rather rare. The lack of repeat 
structures in 5' EST sequences has since been imple- 
mented as criterion in the selection process of cDNA 
clones that are targeted to full-insert sequencing to fur- 
ther increase the impact of the project. 

Functional Classification 

We grouped the cDNAs into functional classes accord- 
ing to homologies of their encoded proteins with al- 
ready known proteins (Table 2 and Fig. 2): cell cycle, 
differentiation and development, membrane protein, 
metabolism, nucleic acid management, protein man- 
agement, signaling and communication, structure and 
motility, transport and traffic, and unknown. Se- 
quence annotations in databases sometimes were mis- 
leading, and the putative function of a protein could 
not be simply deduced by regarding the hit with the 
highest similarity as being the most significant. The 
integration of results from several search algorithms 
was necessary to draw relevant conclusions. For ex- 
ample, the deduced protein sequences were evaluated 
for the presence of specific (protein) sequence patterns 
necessary for the function/activity of a particular pro- 
tein [e.g., the DFG/DWG and aPE motifs had to be 
present in a protein kinase, as reported by Hanks et al 
(1988)]. The results of this functional classification are 
given in Table 2. The largest class constitutes proteins 
of unknown function (202 cDNAs, 41%). Considering 
that for another 39 cDNAs (8%) the only prediction 
that had been possible was that the deduced proteins 
would contain a putative transmembrane domain, no 
function could be inferred to a total of 241 cDNAs 
(48%) of the predicted proteins. But even if functional 
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Figure 2 Functional classification of proteins encoded by the 
cDNAs. The deduced proteins were grouped into 10 functional 
categories based on sequence similarity with proteins of known 
function. The fraction of the 500 cDNAs grouped into the respec- 
tive categories is indicated. 



predictions were possible, the identification, for ex- 
ample, of a protein kinase, neither provides informa- 
tion on its substrates nor on the pathway(s) in which it 
is involved. Comprehensive functional analyses 
should be specifically indicated for a set of cDNAs en- 
coding candidates for genes related to disease, such as 
putative GTP binding proteins, ion channels, and a 
cDNA encoding a protein that is highly similar to an 
oncogene. 

We further analyzed the cDNAs for the presence of 
function-related sequence motifs to also identify novel 
members of gene families. We identified 41 potential 
leucine zipper proteins (Struhl 1989), 11 proteins with 
WD-domains (Neer et al. 1994); 11 proteins with pre- 
dicted zinc finger domains (Parraga et al. 1988), 7 po- 
tential protein kinases, and 5 RNA helicases. The re- 
spective clones are indicated in Table 2 (column 9). 
Two cDNAs (DKFZp5861021 and DKFZp43401826) 
contain both a WD-domain and a leucine zipper. A 
zinc-finger domain is predicted additionally for the de- 
duced protein of the former cDNA. 

Alternative Splicing 

We found 39 (7.8%) cDNAs to represent putative splice 
variants of already known transcripts. This number is 
likely to represent the lower end of the fraction of tran- 
scripts that are alternatively spliced in vivo as any 
cDNAs representing already fully-known transcripts 
were excluded from further sequencing and alternative 
splice forms should therefore be under-represented in 
our set. We found ORFs with additional exons (e.g., 
DKFZp761B192), skipped exons (e.g., DKFZp564A032), 
and alternative exons including one containing the 
translational start codon and resulting in a different N 
terminus of the deduced peptide (e.g., DKFZp434J154). 



The percentage of alternatively spliced cDNAs ap- 
peared to be slightly higher in fetal brain, 40% of the 
alternatively spliced cDNAs originate from fetal brain 
whereas only 28% of all cDNAs analyzed originate 
from this tissue. This finding is consistent with reports 
by Sutcliffe and Milner (1988) and Hanke et al. (1999). 
The presence of intron sequences reminiscent in many 
cDNA sequences available in public databases, how- 
ever, might lead to an overestimation of the extent of 
alternative splicing that is taking place in vivo. Experi- 
mental evidence will therefore be needed to confirm 
presumed alternative splice forms. 

Representation of cDNAs in the UniGene Data Set 
Depending on the true number of human genes, about 
60%-90% have already been identified by partial se- 
quencing of >2,000,000 cDNAs (EST sequencing). 
Overlapping EST sequences have been clustered to 
break down this large number of ESTs to comprehen- 
sive collections that should consist of nonredundant 
data sets having one representation (cluster) for every 
gene. The most widely accepted clustering data set is 
the UniGene (Schuler et al. 1996) resource at the NCBI 
(http://www.ncbi.nlm.nih.gov/UniGene/). This 
dataset currently consists of >90,000 clusters of mostly 
partial sequences. Consensus sequences of these clus- 
ters are available from http://www.rzpd.de. To investi- 
gate the representation of the novel cDNAs reported 
here in the UniGene data set and to evaluate the maxi- 
mum number of genes that could be represented there, 
we aligned the full-length sequences with the UniGene 
database. The version of UniGene (Build 105) that was 
used in the analysis consisted of 92,931 clusters with 
10,501 clusters containing known genes. 

In total, 626 UniGene clusters matched with 472 
out of the 500 full-coding cDNA sequences. The ma- 
jority of cDNAs (342, 68%) was represented by one 
UniGene cluster. An additional 130 (26%) cDNAs were 
represented by 284 separate UniGene clusters (Fig. 3). 
Thus, a number of UniGene clusters could be linked by 
the full-length cDNA sequences. An example of three 
UniGene clusters that were joined with one cDNA is 
given in Figure 4. We analyzed the ESTs and clusters 
that were placed internal to the cDNAs reported here 
and found that most of the EST clones making up these 
clusters had originated from internal priming events 
(mostly in reminiscent intron sequences) and not from 
alternative polyadenylation. The number of 640 clus- 
ters that was hit with 472 cDNA sequences implies that 
there is -35% redundancy in UniGene. As the average 
size of the human transcripts in general has been esti- 
mated to be in the same range as the average size of the 
cDNAs reported here (by quantification of Northern 
blots that had been hybridized with a labeled oligo- 
nucleotide dT probe; N. Nomura, pers. comm.), our 
finding should be representative. However, the true 
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Figure 3 Representation of cDNAs in the UniCene data set 
(Build 105). Every cDNA was aligned with the UniCene data set 
td identify the number of EST clusters that was hit/joined with a 
given cDNA. The fraction and the total number (in parentheses) 
of the cDNAs are given for the varying numbers of clusters being 
hit. 

number of genes represented in UniGene will further 
condense as a considerable fraction of the UniGene 
clusters are singletons (-39%), which are clusters made 
up by only one cDNA, and several of these will even- 
tually turn out to be artifacts. Consequently; we esti- 
mate the number of independent genes that are repre- 
sented in UniGene to be 50,000 at most. 

A fraction of 6% (28 cDNAs) did not have hits in 
the UniGene database (cutoff, sequence identity >95% 
in 50 bp). The low number of the novel cDNAs without 
UniGene matches might in turn imply that >90% of all 
human genes were already represented in this data- 
base. However, we would rather assume that an un- 
known number of genes has escaped cloning and/or 
identification so far as the respective transcripts might 
be expressed only at extremely low levels or in very 
specialized cell types or differentiation stages. A proper 
selection of tissues or even single cell types for cDNA 
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Figure 4 Three UniGene clusters are joined when aligned with 
the cDNA sequence DKFZp434B0435. The bar on top of the scale 
represents the cDNA with the open reading frame drawn as an 
open box. The bars below the scale represent the position and 
size (in bp) of the three UniGene clusters that are joined by the 
cDNA sequence. The accession nos. of representative sequences 
of the respective UniGene clusters are given below the bars. 



library production will be a critical issue for the detec- 
tion and cloning also of these rarely expressed tran- 
scripts. For example, fetal brain, although very com- 
plex in expression, has been so deeply sampled in EST 
projects [especially the IMAGE 1NIB library (Soares et 
al. 1994)] but also in full-length cDNA sequencing (Na- 
gase et al. 2000) that the novelty rate (3 of 142 cDNAs, 
2%) is rather low in this tissue. In contrast, testis cur- 
rently appears to have a higher potential for identify- 
ing transcripts not yet covered by ESTs (19 of 204 
cDNAs, 9%). 

Tissue Specificity of Expression 

To analyze for a possible tissue specificity of expression 
we aligned the cDNA sequences with the EST database 
dbEST. ESTs originating from pooled tissues and tissues 
with unclear origin were excluded. Each cDNA re- 
ceived a score indicating the degree of tissue specific- 
ity. The higher this score, the higher the likelihood 
that expression of the particular transcript should be 
restricted to that tissue. A ubiquitously expressed tran- 
script would have had a score of one. Only cDNAs with 
scores of five or higher are indicated in Table 2 (col- 
umns 10-12). In total, the expression of 22 transcripts 
appeared to be restricted to only one tissue with 
matching tissues of our cDNA and the ESTs (Table 2). 
Six brain-derived cDNAs only matched ESTs that had 
derived from brain tissues. Most of the cDNAs encode 
proteins that are either involved in the cell cycle or 
signaling pathways, for example, a stathmin-like pro- 
tein and a protein similar to a calmodulin-binding pro- 
tein. Only one of the six cDNAs encodes a protein of 
unknown function. Another 15 testis cDNAs had hits 
only with ESTs from testis/male genital tract. Although 
predictions could be made for three of the encoded 
proteins (a predicted sperm flagellar protein, a putative 
neurotransmitter transporter, and a possible nuclear 
pore protein), the other 12 cDNAs encode proteins of 
unknown function. The only uterus cDNA predicted to 
be specifically expressed in uterus/ovary encodes a pu- 
tative chaperone-associated protease, which could in- 
dicate that this protein might be involved in the dif- 
ferentiation of the egg or embryo. The expression of 
several testis-derived transcripts appeared to be very 
selective as the scores calculated for these cDNAs were 
rather high, compared with scores obtained with other 
cDNAs and tissues (Table 2). This also matches the ob- 
servation that the novelty rate, counting cDNAs with- 
out EST hits, was highest in the testis library (see 
above). 

cDNAs Mapping to Human Chromosomes 21 and 22 

To demonstrate the power of mapping genes by align- 
ing cDNA with genomic sequences we downloaded the 
sequences of the first two completely sequenced hu- 
man chromosomes 21 (Hattori et al. 2000) and 22 
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(Dunham et al. 1999) and aligned them with those 
novel cDNAs mapping to the respective chromosomes 
(Table 3). Clone identifiers of the respective cDNAs and 
the insert and ORF sizes are provided in the first three 
columns. For ORF sizes (column 3) the predicted num- 
ber of amino acid residues is given first, followed by the 
number of the residues deduced from the cDNA se- 
quence; a dash (-) is inserted for proteins that were not 
predicted. The predicted localization as based on 
mainly STS data is given in the fourth column, fol- 
lowed by the exact localization of the genes (gene locus 
in bp as defined in the published sequences of chro- 
mosome 21, http://hgp.gsc.riken.go.jp, and chromo- 
some 22, http://www.sanger.ac.uk/cgi-bin/cwa/ 
22cwa.pl). The accession numbers of the genomic 
clone(s) covering the genes, identifiers of predicted 
transcripts (if available; dashes indicate nonpredicted 
genes), the number of predicted exons out of the num- 
ber of identified exons (based on cDNA sequence), and 
the number of UniGene clusters that were hit with the 
respective cDNAs are given in columns 6-9. 

Whereas 13 of the novel cDNAs map to chromo- 
some 22, only two cDNAs map to chromosome 21. 
This could either be a reflection of the generally higher 
gene content of chromosome 22 (554 compared with 
the 225 predicted genes on chromosome 21) or be a 
result of the fact that the percentage of genes that had 
been known previously is higher for chromosome 21 
(this chromosome had long been carefully investigated 
because of its clinical implications, e.g., in Down syn- 
drome). A third explanation could be a correlation be- 
tween chromosomal location and global expression 
levels of the individual genes, as has been proposed by 
Ewing and Green (2000), with genes mapping to chro- 
mosome 21 in general possibly being expressed at 
lower levels compared with genes located on chromo- 
some 22. 

By combining the genomic and cDNA data, the 
exact gene structures of all 15 cDNAs could be deter- 
mined. Although all cDNAs'were covered by UniGene 
clusters, only 8 of the 15 genes had been predicted 
from the genomic sequence. Most of these gene pre- 
dictions were precise, identifying the majority or all 
exons. The number of amino acid residues varied in 
most cases only marginally from the number deduced 
from the cDNA sequence. However, one cDNA 
(DKFZp564B212) merged three predicted transcripts to 
only one gene and overlapped another gene 
(bK445C9.C22.3) predicted on the opposite strand. In 
total, seven genes had completely failed to be pre- 
dicted, some of which encode rather large ORFs and 
consist of several exons. 

The mapping information that is based on ge- 
nomic sequence not only gives the exact localization 
of individual genes but also provides information on 
the context of these genes in view of neighboring 



genes (e.g., DKFZp434B194 and DKFZp564B212 are 
only 13 kb apart) and the presence of probable addi- 
tional gene copies. For example, the genes of cDNAs 
DKFZp434N035 and DKFZp434P211 appear to be pres- 
ent on chromosome 22 in 2 and 9 highly similar copies 
(>85% sequence identity on nucleotide level), respec- 
tively. DKFZp434P211 could indicate a cluster of 
highly similar POM121 related genes (Fig. 5), the first 
of which was described by Kawasaki et al. (1997). Two 
copies (2850458 and 2871777) seem to be ancient and 
inactive as they are incomplete, contain several frame 
shifts, and share only 89% and 87% sequence identity 
with the cDNA sequence in exon 1, respectively. The 
other copies are highly similar (>95% identity on 
nucleotide level). Further experiments will be neces- 
sary to investigate how many of the gene copies are 
expressed and to explain the presence of the stop 
codon atposition 429 in three of the gene copies (and 
in the cDNA) but a sense codon in this position in four 
other gene copies, possibly leading to an extended pro- 
tein product. EST evidence is available for transcripts of 
both types of genes (e.g., for copies 5055694 and 
8220566). 

DISCUSSION 

The considerable fraction of genes that were not pre- 
dicted in the analysis of the chromosome 21 and 22 
sequences was somewhat surprising, as EST data and 
UniGene clusters (Table 3) were also available for these 
genes. Three of the genes that were not predicted even 
appear to be present in more than one copy on the 
same chromosome, namely, within 6 Mb on chromo- 
some 22. But even if all genes could be identified via 
bioinformatic procedures, the alternative use of exons 
and promoters (alternative splicing) constitutes a prob- 
lem that cannot currently be solved with knowledge of 
the genomic sequence alone. Consequently, only the 
availability of cDNA sequences enables us to define the 
precise protein coding parts of the genome and, in con- 
junction with the genomic counterpart, to also define 
the composition of exons in alternatively spliced tran- 
scripts of the same gene. Both the sequence and the 
chromosomal location of genes are important pieces of 
information supportive also in the process of defining 
and analyzing candidate disease genes. 

Most of the genome has been unraveled as draft 
sequence, where sequence submissions of individ- 
ual genomic clones are released in several contigs 
of varying length. These contigs are usually not 
ordered relative to one another. However, automated 
assembly and annotation tools like GoldenPath 
(http://genome.ucsc.edu/goldenPath/hgTracks.html) 
try to overcome this problem and prove to be ex- 
tremely helpful for the mapping of cDNAs. The avail- 
ability of cDNA sequences in turn immediately helps 
to identify the genes that are located on the respective 
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Figure 5 Multiple sequence alignment of cDNA DKFZp434P211 with POM! 21 -related 1 (accession no. D87002) and 
sequences from chromosome 22 demonstrate the presence of a cluster of POM! 21 -related genes. The individual genomic 
sequences were named after the start of the first exon relative to the cDNA: The open reading frame (ORF) was defined 
according to the predicted protein of the cDNA and of POM1 21 -related 1. Genes located on the plus and minus strands 
of chromosome 22 are indicated with + and -, respectively. The cDNA sequence of DKFZp434P21 1 was taken as reference; 
identical residues in other sequences are indicated with a dot, residues deviating from the consensus are printed. Asterisks 
(*) indicate stop codons. The genomic sequences 2850458 and 2871 777 are in italics because these copies deviate from the 
other copies by a premature stop or frame shifts and a large insertion, respectively, and are probably not expressed. In these 
two gene copies the initiator ATG is mutated. Dashes (-) were inserted by the software (clustal) to optimize the 
alignment. 



genomic clones, to support the ordering of the draft 
sequence contigs, and to narrow down the regions 
where putative regulatory elements should reside. 
Thus, cDNA and genomic sequences are complemen- 
tary and synergistically add information. The blast 
analysis of cDNAs and matching genomic sequences 
showed that only 32 cDNAs did not have correspond- 
ing genomic matches (not covered, NC in Table 2, col- 
umn 5), which is the number expected because >91% 
of the genomic sequence are reported to be unraveled. 



The chromosomal localization could be approximated 
for 449 cDNAs using the GoldenPath web browser; 21 
BACs had not been mapped (NM). The accession num- 
bers of these BACs are provided in column 5 of Table 2. 
The combination of genomic and cDNA sequence pro- 
vides the gene structures with precise exon-intron 
boundaries and defined intron sequences. 

Furthermore, it will become increasingly impor- 
tant to not only have the human genes identified but 
rather to characterize the precise functions of the en- 
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coded proteins and also the functions of those tran- 
scripts that are not translated. To this end, full-coding 
cDNA representations are indispensable tools, for ex- 
ample; for the subcloning of exactly defined ORFs into 
expression vectors. However, currently only -11,000 
nonredundant cDNA sequences have been deposited 
in public databases which are supposed to contain the 
complete protein coding ORF. An even lower number 
of these full-coding ORFs can be obtained as cDNA 
clones through commercial or noncommercial provid- 
ers (e.g., ATCC, Genome Systems, Research Genetics, 
HGMP, Resource Center of the German Genome 
Project) and would thus be available for functional re- 
search. 

Recently, the range of estimates given for the 
number of human genes has evolved to the lower end, 
because in two calculations only -35,000 human genes 
have been predicted (Ewing and Green 2000; Roest 
Crollius et al. 2000). Our data would also hint at a 
lower than previously expected number, as we would 
estimate the number of genes currently represented in 
UniGene to be 50,000 at most. Still, the real number of 
human genes needs to be established by further cDNA 
and also -by comparative genomic sequencing (e.g., of 
the mouse). If it should hold true, however, that the 
number of genes in human was indeed only about two- 
fold higher than the -18,000 genes that have been pre- 
dicted for Caenorhabditis elegans by The C. elegans Se- 
quencing Consortium (1998) the question would arise 
as to where the difference in complexity between these 
two life forms originated. Because the sheer doubling 
of gene number would not be likely to account for all 
differences, the comprehensive analysis of gene and 
protein function(s) would become an even greater 
problem. This is because one solution to this apparent 
paradox could be the acquisition of multiple functions 
by many of the proteins expressed in human. This 
would add another order of complexity to the line 
starting with the genome and continuing through the 
transcriptome with alternative splicing, the proteome 
with post-translational modifications, and finally (?) to 
a 'functiome,' which would cover the acquisition of 
diverse functions by the same protein depending on its 
cellular and subcellular environment. Several examples 
of such multiple usages of proteins have already been 
described Qeffery 1999). 

In the set of 500 novel cDNAs described here, only 
about half of the deduced proteins could be function- 
ally classified, while identification, for example, of a 
protein kinase does not provide information on sub- 
strates or pathways in which this protein is involved. 
Additionally, half of the predicted proteins remain 
without any hint as to their possible function. With 
this in mind, the establishment of a gene catalog 
which will eventually contain a nonredundant set of 
full-coding cDNA sequences and clones covering every 



human gene, is prerequisite to carry out the experi- 
ments needed to precisely identify the protein func- 
tion^). This catalog should be the result of a global 
enterprise integrating the data and clones from as 
many projects and researchers as possible and could be 
an extension of already existing databases such as 
GeneCards (Rebhan et al. 1998) and RefSeq (Pruitt et 
al. 2000) with, for example, links to the clone providers 
mentioned above. In addition to the novel full-coding 
cDNA sequences and clones described here, we have 
identified over 1000 cDNAs which comprise full- 
coding representations of previously known genes. In 
combination, these cDNAs represent 2%-5% of all hu- 
man genes and will thus be a substantial part of the 
catalog and be ideal tools to carry out functional analy- 
ses. Although the 500 novel cDNAs have been fully 
sequenced and can be directly used in functional 
analysis, the cDNAs representing known genes need 
further characterization because these are not fully 
sequenced. To this end, we amplify the ORFs from 
these cDNAs and verify the predicted size. -These 
ORFs are then cloned into a bacterial expression 
vector which contains a N-terminal fusion with the 
GFP. As the Gateway system (Life Technologies) is 
employed in the cloning process, the ORFs can be 
shuttled into any expression vector (Simpson et al. 
2000). Only intact reading frames (no PCR frame shifts, 
no introns, no frame shifts in the clone) lead to fluo- 
rescent colonies as the ORF extends uninterrupted into 
the GFP. The Gateway entry clones of the verified 
genes are also made available through the Resource 
Center. 

To address the systematic functional analysis of 
the novel proteins, a large-scale project dealing with 
the subcellular localization and functional analysis of 
the proteins encoded by newly identified cDNAs re- 
ported here is underway (Simpson et al. 2000). Thus, 
the gene catalog in upcoming years will form the basis 
for the large-scale and comprehensive functional 
analysis of human genes and proteins, which is crucial 
to understand the basis of human life, disease, and 
death. 

METHODS 

Library Construction 
SMART Libraries 

The DKFZp564 (human fetal brain) and DKFZp566 (human 
fetal kidney) libraries were generated using the SMART kit 
(Clontech). PCR amplification of the cDNA was necessary to 
obtain enough cDNA for cloning. The first-strand primer did 
contain the KS sequence of the pBluescript vector (Stratagene) 
and any base but T (IUB code = V) in the 3 '-terminal position 
of the primer [TCGAGGTCGACGGTATCGATAAG(T) 19 V] . 
Amplification of the primary cDNA with Amplitaq (Perkin 
Elmer) and Pfu (Stratagene) DNA polymerases in a ratio of 
19/1 (vol/vol) was carried out with primers that contained 
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uracil residues (3' primer: CAUCAUCAUCAUCGAGGTCGAC 
GGTATCGATAAG; 5' primer: CUACUACUACUATACGCT 
GCGAGAAGACGACAGAA) and that were compatible with 
the pAMPl (Life Technologies) cloning sites for directional 
cloning. Prior to cloning, the cDNA was size fractionated on 
an agarose gel. Fragments >2 kb were excised and extracted 
from the gel using GELase (Epicentre). Cloning was done us- 
ing uracil deglycosilase (UDG, LifeTechnologies) and chemi- 
cally competent bacterial cells (XL-2 Blue, Stratagene). 

Conventional Libraries 

The DKFZp434 (human adult testis), DKFZp586 (human adult 
uterus), and DKFZp761 (human adult amygdala) libraries 
were generated using conventional approaches (Gubler and 
Hoffman 1983), employing a Nofl-dT V primer for first-strand 
synthesis [GAGCGGCCGC(T) 19 V]. After second-strand syn- 
thesis, Sail adapters were ligated to the blunted cDNA. Then 
the cDNA was cut with Norl to generate Sa/I-Notl-compatible 
ends at the 5' and 3' ends of the cDNA, respectively, to allow 
directional cloning. The cDNAs were then size-selected on 
agarose gels in two dimensions and cloned into pSPORTl pre- 
cut with Sail and Norl (Life Technologies). 

Availability of cDNA Libraries and Clones 

All libraries have been arrayed into 3 8 4- well microti ter plates 
and spotted on high-density nylon membranes. Each library 
consists of 27,000 clones or multiples thereof. High-density 
clone filters and individual clones are available through the 
Resource Center of the German Genome Project (http:// 
www.RZPD.de; clone@pzpd.de). 

Selection of Clones for Sequencing 

First, 5' ESTs were systematically generated from all clones of 
384-well microtiter plates. The sequences were analyzed with 
blastn (Altschui et al. 1990) and blastx (Gish and States 
1993) against EMBL, PIR, SWISSPROT, and TREMBL databases 
for the lack of identical (>95% identity over 50 bp) matches 
with known cDNAs, and for the presence of ORFs. 

Clones with novel sequences were 3' end sequenced. 
These 3'" ESTs were checked for the lack of matches with 
known genes in public databases, for repeat structures, and for 
the presence of polyadenylation signals. Clones matching the 
selection criteria were subjected to full-length sequencing. 

Sequencing Methodology and Strategy 

Sequencing was done preferentially using dye terminator 
chemistry (Applied Biosystems or Amersham) on ABI 377 au- 
tomated DNA sequencers; one partner used EMBL prototype 
instruments (Wiemann et al. 1995) mainly with dye primer 
chemistry. Primer walking (Strauss et al. 1986) was the pre- 
ferred sequencing strategy for the full-length sequencing of 
cDNAs. Design of walking primers was done preferentially 
using software (e.g., Schwager et al. 1995; Haas et al. 1998) 
that permitted the complete automation of this usually-time- 
consuming process and thus helped in the parallel processing 
of large numbers of clones. 

Bioinformatic Analysis 

Every complete cDNA sequence was compared with the se- 
quences in EMBL, EMBL-EST, EMBL-STS using BLASTN 
(Altschui et al. 1990). Searches against EMBL were done to 
determine whether the cDNAs were already known and to 
identify any genomic sequence information available that 
would cover the respective genes. Searches against EMBL-EST 



were performed to analyze for the abundance of transcripts, 
to obtain information on a possible tissue specificity of ex- 
pression, and to identify putative alternative splice forms or 
alternative use of polyadenylation signals. The annotations 
on the source tissue of the respective EST clones were parsed 
from the database entries to calculate the real ratio versus the 
expected ratio of expression according to the equation: (# hits 
tissue/total # hits)/(# ESTs tissue/total # ESTs). A gene that was 
transcribed at a constant level in many tissues would have a 
ratio of one. Significant higher or lower ratios would indicate 
increased or decreased levels of transcription in the tissue, 
respectively. To identify tissue-specific expression, the param- 
eters were set to >4 ESTs matching the respective cDNA that 
needed to have been sequenced from a given tissue, and the 
cutoff for the ratio of overexpression was set to five. ESTs 
originating from pooled tissues or that were of unspecified 
origin were disregarded in this analysis. To obtain chromo- 
somal mapping information, the sequences were aligned with 
the EMBL-STS database. 

The potential protein-sequences were identified by a 
search for the longest ORF in each of the three forward frames 
with a minimum length of 90 codons. The deduced protein 
sequences were searched against the nonredundant protein 
data set of PIR, SWISSPROT, and TREMBL [B LAS TP, using the 
SEG-ftTter by Wootton (1994)]. Any cDNAs without ORF >90 
codons were analyzed with blastn against TREMBL to iden- 
tify even shorter ORFs present. 

blastx searches were performed against a nonredun- 
dant protein database comprising PIR, SWISSPROT, and 
TREMBL. The SEG-filter was used to screen for potential frame 
shifts in the coding sequences of the cDNAs and to identify 
cDNAS that were not fully spliced or were alternatively 
spliced. The protein sequence was then transferred to pedant 
(Frishman and Mewes 1997). pedant performed automated 
database searches: psiBLAST (Altschui etal. 1997), an iterated 
profile search procedure; hmmer (Sonnhammer et al. 1997), a 
Hidden Markov model software which uses statistical descrip- 
tions of a sequence family's consensus; and blimps (Wallace 
and Henikoff 1992) for similarity searches against the 
BLOCKS (Henikoff et al. 2000) database. PROSITE protein se- 
quence patterns were identified by ProSearch (Kolakowski et 
al. 1992). clustal-w (Thompson et al. 1994) was used for 
multiple sequence alignments of DNA and proteins. Trans- 
membrane regions were identified by ALOM2 (Klein et al. 
1984), and signal peptides in secreted proteins by signalp 
(Nielsen et al. 1997). seg (Wootton and Federhen 1993) has 
been employed to detect low-complexity regions in protein 
sequences and coils (Lupas et al. 1991) for the detection of 
coiled coils. For the functional classification of the cDNAs 
sequence, identities with E-values <10£-30 (blastn) and 
<10£ - 10 (blastx) were accepted to be significant. The com- 
prehensive bioinformatic data on all cDNAs analyzed by the 
Consortium are accessible at http://www2.mips.biochem. 
mpg.de/proj/cDNA/index.html. Mapping of the cDNAs to 
chromosomes was done first by blast analysis of the cDNA 
sequences against the human genomic sequence (NCBI-htgs 
database), followed by identifying the mapping position with 
help of the GoldenPath Qim Kent, UCSC) browser (http:// 
genome.ucsc.edu/goldenPath/hgTracks.html). 

Availability of Clones and Further Information 

All clones described here, and the other clones analyzed by 
the German cDNA Consortium, are available from the Re- 
source Center of the German Genome Project(http:// 
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wrww.rzpd.de; clone@rzpd.de). The comprehensive bioinfor- 
matic data on all cDNAs analyzed by the Consortium are ac- 
cessible at http://www2.mips.biochem.mpg.de/proj/cDNA/ 
index. html. Additional information about the analysis of the 
described set of cDNAs is available at http://www.dkfz- 
heidelberg.de/abt0840/GCC. The full version of Table 2 can 
be obtained at this location in Excel, tab-delineated text, and 
pdf formats. 
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